Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for using payloads to boost terms #3772

Closed
bdurand opened this issue Sep 24, 2013 · 5 comments
Closed

Add support for using payloads to boost terms #3772

bdurand opened this issue Sep 24, 2013 · 5 comments

Comments

@bdurand
Copy link

bdurand commented Sep 24, 2013

It would be great to be able so have a mapping field which stores payloads with terms and be able to use the payloads to boost the score of the document.

In my particular use case, I have documents which are tagged by users and after running through filters and algorithms we can determine which tags are most likely useful and which are likely spam. We'd like to pass that information on to the search index so that we can boost the documents we think are most appropriate to the search terms. In this case the boost is known at indexing time and applies to the terms themselves and not to the field or the documents.

This is something that's been possible with Lucene for quite awhile and which Solr had partial support for, but never fully implemented out of the box. (See for example http://wiki.apache.org/solr/Payloads, http://searchhub.org/2009/08/05/getting-started-with-payloads/, http://hnagtech.wordpress.com/2013/04/19/using-payloads-with-solr-4-x/).

Ideally, it would be best to pass the payload in as a separate JSON field value in the document. The Solr tokenizer for payloads (DelimitedPayloadTokenFilterFactory) uses a delimiter, but I've found this to be problematic when dealing with user generated terms. In addition, it would be best to have the payload value somehow available in scripting so the payloads can be indexed once and then the scoring algorithms tweaked as necessary to get the right scores.

@brwe
Copy link
Contributor

brwe commented Sep 30, 2013

I think there are (at least) three issues here:
Taking the payloads into account for scoring could indeed be useful. I will try to come up with something. However, I would like to know how you believe the payload should affect the score. Since the same token can have different payloads, would you have an average of these numbers, the max, the min,...?

As for how to get the payloads in, I believe this is a different issue. It would be easy to expose the DelimitedPayloadTokenFilter in elasticsearch but passing the payloads in while indexing the document might be more tricky. If you desperately need that, could you open a new issue for that?

I do not fully understand how scipting support for payloads should work. Can you elaborate on this a bit or come up with an example?

@bdurand
Copy link
Author

bdurand commented Oct 9, 2013

In my mind the scripting and scoring are tied together simply because I believe this is the kind of issue where you'd need to play with the data after indexing it to get the right scoring. Since the payload would need to be added at indexing time, it would be much easier if the "payload score" could be exposed to a scoring script used for ordering.

My particular use case in detail would be:

  1. While indexing documents, count the number of times each distinct tag has been applied by users to the document. This value would be included with each tag indexed for the document to indicate the weight for that particular tag.
  2. When searching, we would apply the previously defined weights for the tags (terms) in a custom scoring script.

We haven't worked out the actual algorithms yet and it is definitely something we'd need to play around with to get the right values. I would imagine the payload, though, would likely be a number between 0.0 and 1.0 indicating the confidence that the term was an accurate one.

@brwe
Copy link
Contributor

brwe commented Oct 24, 2013

Sorry for the late reply:
I agree that having the payloads available in a script would indeed be helpful to evaluate different scoring functions. But be warned: It will be very slow and only be good for prototyping.

I think the easiest way to do this is to simply make all term information for a document available in scripts. It would be similar to the term vector api. You would then have the freedom to choose any kind of document features for scoring.
What do you think?

@brwe
Copy link
Contributor

brwe commented Nov 13, 2013

I made the pull request (#4161) that allows to access payloads amongst other term information in a script. If you are still interested, take a look and see if this is useful for you!

@bdurand
Copy link
Author

bdurand commented Nov 13, 2013

Looks great! Thank you.

brwe added a commit to brwe/elasticsearch that referenced this issue Dec 18, 2013
term statistics can be accessed via the _shard variable.

Below is a minimal example. See documentation on details.

```

DELETE paytest

PUT paytest
{
    "mappings": {
        "test": {
            "_all": {
                "auto_boost": true,
                "enabled": true
            },
            "properties": {
                "text": {
                    "index_analyzer": "fulltext_analyzer",
                    "store": "yes",
                    "type": "string"
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "fulltext_analyzer": {
                    "filter": [
                        "my_delimited_payload_filter"
                    ],
                    "tokenizer": "whitespace",
                    "type": "custom"
                }
            },
            "filter": {
                "my_delimited_payload_filter": {
                    "delimiter": "+",
                    "encoding": "float",
                    "type": "delimited_payload_filter"
                }
            }
        },
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 1
        }
    }
}

POST paytest/test/1
{
    "text": "the+1 quick+2 brown+3 fox+4 is quick+10"
}

POST paytest/test/2
{
    "text": "the+1 quick+2 red+3 fox+4"
}

POST paytest/_refresh

POST paytest/_search
{
    "script_fields": {
       "ttf": {
          "script": "_shard[\"text\"][\"quick\"].ttf()"
       }
    }
}

POST paytest/_search
{
    "script_fields": {
       "freq": {
          "script": "_shard[\"text\"][\"quick\"].freq()"
       }
    }
}
POST paytest/test/2/_termvector
POST paytest/_search
{
    "script_fields": {
       "payloads": {
          "script": "term = _shard[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;"
       }
    }
}

POST paytest/_search
{
   "script_fields": {
      "tv": {
         "script": "_shard[\"text\"][\"quick\"].freq()"
      }
   },
   "query": {
      "function_score": {
         "functions": [
            {
               "script_score": {
                  "script": "_shard[\"text\"][\"quick\"].freq()"
               }
            }
         ]
      }
   }
}

```

closes elastic#3772
brwe added a commit that referenced this issue Jan 2, 2014
term statistics can be accessed via the _shard variable.

Below is a minimal example. See documentation on details.

```

DELETE paytest

PUT paytest
{
    "mappings": {
        "test": {
            "_all": {
                "auto_boost": true,
                "enabled": true
            },
            "properties": {
                "text": {
                    "index_analyzer": "fulltext_analyzer",
                    "store": "yes",
                    "type": "string"
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "fulltext_analyzer": {
                    "filter": [
                        "my_delimited_payload_filter"
                    ],
                    "tokenizer": "whitespace",
                    "type": "custom"
                }
            },
            "filter": {
                "my_delimited_payload_filter": {
                    "delimiter": "+",
                    "encoding": "float",
                    "type": "delimited_payload_filter"
                }
            }
        },
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 1
        }
    }
}

POST paytest/test/1
{
    "text": "the+1 quick+2 brown+3 fox+4 is quick+10"
}

POST paytest/test/2
{
    "text": "the+1 quick+2 red+3 fox+4"
}

POST paytest/_refresh

POST paytest/_search
{
    "script_fields": {
       "ttf": {
          "script": "_shard[\"text\"][\"quick\"].ttf()"
       }
    }
}

POST paytest/_search
{
    "script_fields": {
       "freq": {
          "script": "_shard[\"text\"][\"quick\"].freq()"
       }
    }
}
POST paytest/test/2/_termvector
POST paytest/_search
{
    "script_fields": {
       "payloads": {
          "script": "term = _shard[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;"
       }
    }
}

POST paytest/_search
{
   "script_fields": {
      "tv": {
         "script": "_shard[\"text\"][\"quick\"].freq()"
      }
   },
   "query": {
      "function_score": {
         "functions": [
            {
               "script_score": {
                  "script": "_shard[\"text\"][\"quick\"].freq()"
               }
            }
         ]
      }
   }
}

```

closes #3772
@brwe brwe closed this as completed in 1ede9a5 Jan 2, 2014
brusic pushed a commit to brusic/elasticsearch that referenced this issue Jan 19, 2014
term statistics can be accessed via the _shard variable.

Below is a minimal example. See documentation on details.

```

DELETE paytest

PUT paytest
{
    "mappings": {
        "test": {
            "_all": {
                "auto_boost": true,
                "enabled": true
            },
            "properties": {
                "text": {
                    "index_analyzer": "fulltext_analyzer",
                    "store": "yes",
                    "type": "string"
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "fulltext_analyzer": {
                    "filter": [
                        "my_delimited_payload_filter"
                    ],
                    "tokenizer": "whitespace",
                    "type": "custom"
                }
            },
            "filter": {
                "my_delimited_payload_filter": {
                    "delimiter": "+",
                    "encoding": "float",
                    "type": "delimited_payload_filter"
                }
            }
        },
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 1
        }
    }
}

POST paytest/test/1
{
    "text": "the+1 quick+2 brown+3 fox+4 is quick+10"
}

POST paytest/test/2
{
    "text": "the+1 quick+2 red+3 fox+4"
}

POST paytest/_refresh

POST paytest/_search
{
    "script_fields": {
       "ttf": {
          "script": "_shard[\"text\"][\"quick\"].ttf()"
       }
    }
}

POST paytest/_search
{
    "script_fields": {
       "freq": {
          "script": "_shard[\"text\"][\"quick\"].freq()"
       }
    }
}
POST paytest/test/2/_termvector
POST paytest/_search
{
    "script_fields": {
       "payloads": {
          "script": "term = _shard[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;"
       }
    }
}

POST paytest/_search
{
   "script_fields": {
      "tv": {
         "script": "_shard[\"text\"][\"quick\"].freq()"
      }
   },
   "query": {
      "function_score": {
         "functions": [
            {
               "script_score": {
                  "script": "_shard[\"text\"][\"quick\"].freq()"
               }
            }
         ]
      }
   }
}

```

closes elastic#3772
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
term statistics can be accessed via the _shard variable.

Below is a minimal example. See documentation on details.

```

DELETE paytest

PUT paytest
{
    "mappings": {
        "test": {
            "_all": {
                "auto_boost": true,
                "enabled": true
            },
            "properties": {
                "text": {
                    "index_analyzer": "fulltext_analyzer",
                    "store": "yes",
                    "type": "string"
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "fulltext_analyzer": {
                    "filter": [
                        "my_delimited_payload_filter"
                    ],
                    "tokenizer": "whitespace",
                    "type": "custom"
                }
            },
            "filter": {
                "my_delimited_payload_filter": {
                    "delimiter": "+",
                    "encoding": "float",
                    "type": "delimited_payload_filter"
                }
            }
        },
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 1
        }
    }
}

POST paytest/test/1
{
    "text": "the+1 quick+2 brown+3 fox+4 is quick+10"
}

POST paytest/test/2
{
    "text": "the+1 quick+2 red+3 fox+4"
}

POST paytest/_refresh

POST paytest/_search
{
    "script_fields": {
       "ttf": {
          "script": "_shard[\"text\"][\"quick\"].ttf()"
       }
    }
}

POST paytest/_search
{
    "script_fields": {
       "freq": {
          "script": "_shard[\"text\"][\"quick\"].freq()"
       }
    }
}
POST paytest/test/2/_termvector
POST paytest/_search
{
    "script_fields": {
       "payloads": {
          "script": "term = _shard[\"text\"].get(\"red\",_PAYLOADS);payloads = []; for(pos : term){payloads.add(pos.payloadAsFloat(-1));} return payloads;"
       }
    }
}

POST paytest/_search
{
   "script_fields": {
      "tv": {
         "script": "_shard[\"text\"][\"quick\"].freq()"
      }
   },
   "query": {
      "function_score": {
         "functions": [
            {
               "script_score": {
                  "script": "_shard[\"text\"][\"quick\"].freq()"
               }
            }
         ]
      }
   }
}

```

closes elastic#3772
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants